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ABSTRACT 

Viral integration plays an important role in the devel- 
opment of malignant diseases. Viruses differ 
in preferred integration site and flanking sequence. 
Viral integration sites (VIS) have been found 
next to oncogenes and common fragile sites. 
Understanding the typical DNA features near VIS is 
useful for the identification of potential oncogenes, 
prediction of malignant disease development and 
assessing the probability of malignant transform- 
ation in gene therapy. Therefore, we have built a 
database of human disease-related VIS (Dr.VIS, 
http://www.scbit.org/dbmi/drvis) to collect and 
maintain human disease-related VIS data, including 
characteristics of the malignant disease, chromo- 
some region, genomic position and viral-host 
junction sequence. The current build of Dr.VIS 
covers about 600 natural VIS of 5 oncogenic 
viruses representing 11 diseases. Among them, 
about 200 VIS have viral-host junction sequence. 



INTRODUCTION 

The contribution of infectious agents to the development 
of serious human diseases, especially tumors, is increas- 
ingly understood (1). It is estimated that viral infections 
contribute to 15-20% of all human cancers (2). Research 
has revealed that integration of viral genomes into human 
chromosomes is necessary for most viral induction of 
tumor development, which can activate or inactivate 
host genes by means of provirus insertion (2,3). This 
holds not only for retroviruses such as human T-cell 
leukemia virus (4), but also for a number of 
non-retroviruses such as human papillomavirus (5) and 
hepatitis B virus (2,6). Finally, integration events can 



cause rearrangements of viral and host sequences (7), 
expression of fused transcripts, deletions of chromosomal 
sequences and transpositions of viral sequences from one 
chromosome to another (8-10). Viral integration is 
site-specific in many cases (11). Moreover, viruses differ 
in their preferred insertion site (12). Viral integration sites 
(VIS) have become a key to associating viral infection and 
human malignant disease. 

Up to date, at least seven viruses have been compel- 
lingly associated with human malignant diseases, 
including: 

(1) HTLV-1 (adult T-cell leukemia and tropical spastic 
paraparesis) (13); 

(2) HPV (cervical cancer, head and neck cancer and ano- 
genital cancer) (14,15); 

(3) HHV-8 (Kaposi's sarcoma) (16); 

(4) EBV (Burkitt's lymphoma) (17); 

(5) HBV (hepatocellular carcinoma) (18); 

(6) MCV, Merkel cell polyomavirus (Merkel cell carcin- 
oma) (19); and 

(7) HIV (AIDS and B-cell lymphoma) (1). 

There are many viruses that are potentially associated 
with human malignant diseases such as Simian virus 40 
(brain cancer, bone cancer and mesothelioma), BK virus 
(prostate cancer) and so on (1-3). Some are still under 
study, such as xenotropic murine leukemia virus-related 
virus whose relationship with prostate cancer is still 
controversial (20-22). Most of those viruses have a signifi- 
cant integration step in viral infection and disease 
development. 

Viral integration can activate gene expression to cause 
malignant disease if the VIS is close to an oncogene. This 
process known as insertional mutagenesis (23), has 
allowed identification of potential cellular oncogenes 
through mapping of retroviral integration sites (23,24). 
This work has also led to the development of a database 
of cancer-associated genes (23,25). 
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Gene therapy holds promise for curing many malignant 
diseases. However, current gene therapy methods have 
limited control over where a therapeutic virus inserts 
into the human genome. It was reported that several 
patients developed T-cell leukemia during treatment of 
X-linked severe combined immunodeficiency (SCID-X1), 
because of viral integration near the proto-oncogenes 
LM02, BMI1 and CCND2 (23,26). 

Therefore, understanding the genes and DNA features 
near disease-related VIS will abet the identification of po- 
tential oncogenes, prediction of malignant disease devel- 
opment and assessment of the probability of malignant 
transformation in gene therapy. However, numerous 
identified VIS are still widely scattered in published 
papers. In this study, we developed a database of human 
disease-related VIS (Dr. VIS) to collect and maintain those 
data from the literature (PubMed) and public databases 
(GenBank) (27). Furthermore, each VIS is linked to the 
UCSC Genome Browser (28) and Ensembl Genome 
Browser (29) for more detailed viewing of genomic traits. 

MATERIALS AND METHODS 

Data model of VIS and clusters 

The following characteristics are listed for each human 
disease-related VIS: virus name, chromosome region, 



Table 1. Confidence codes 



Code 


Description 


Integration sites count 


WK 


Well known 


/>5 


ss 


Strongly supported 


1 <f< 5 


so 


Single observation 


/= 1 



locus, genomic position, viral-host junction sequence 
and corresponding human disease. The chromosome 
region is denoted as cytogenetic band. The locus must 
have been approved by HGNC (30) and can be a 
microRNA or an interrupted gene with specific coordin- 
ates of subcomponents (exons or introns). Genomic 
position is the position of the insertion point in the 
genome as represented in the Human Genome Assembly 
2009 (hgl9) (31). Viral-host junction sequence is always 
recorded as 'human genome-viral genome-human 
genome'. 

In Dr. VIS, VIS representing the same virus name, 
chromosome region and human disease, are clustered to 
generate a unique data entry called a viral integration 
cluster (or VIS cluster) for convenient data organization. 
Genomic traits of a VIS cluster include common fragile 
site (32), microRNA, gene distribution and son on. More 
detailed traits are crosslinked to HGNC (30), UCSC (33) 
and Ensembl (29), through their chromosome coordinates. 
Furthermore, each VIS cluster is assigned a confidence 
code (Table 1) to indicate its frequency. 

Collection of VIS associated with human diseases 

VIS related to human disease were collected from PubMed 
and GenBank (Figure 1). All VIS deposited in Dr. VIS are 
sequenced or detected from natural samples of patients. A 
Perl script extracted viral-host junction sequences from 
GenBank by matching keywords (i.e. integration site) 
and annotation of both host and virus (i.e. Homo 
sapiens and a virus) as regular expressions. The script ex- 
tracted PMIDs from the original literature reporting 
junction sequences, for subsequent manual retrieval and 
processing curation from PubMed. 
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Figure 1. Work flow of data collection and re-mapping. 
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Papers reporting disease-related viral integration into 
the human genome were collected from PubMed in two 
ways, by script as described immediately above, and by 
manual search of the keywords virus, integration site, 
cancer, tumor, malignancy and disease. About 200 ini- 
tially selected papers were obtained and filtered for rele- 
vance; curators read nearly 80 finally selected papers in 
full to extract the VIS characteristics required in the 
data model. In some cases, exact junctions were 
transcribed from illustrations in the papers. Sequences 
denoted with accession numbers are downloaded directly 
from GenBank. 

Re-mapping of VIS 

Three fields of a VIS (genomic position, chromosome 
region and locus) are updated by re-mapping according 
to the viral-host junction sequence obtained (Figure 1). 

Mapping of genomic position. The genomic position of a 
VIS in the' Human Genome Assembly 2009 (hgl9) (31) is 
identified using BLAT from UCSC (33), provided that the 
identity of the BLAT result exceeds 80%. When there are 
two or more positive alignments, a manual check helps to 
choose the correct one. 

Mapping of locus. The locus of integration is always inter- 
rupted, and potentially inactivated, by viral insertion. Loci 
were identified using the Genes and Gene Tracks Table 
from UCSC (34), and VIS were mapped to the gene com- 
ponent (exon, intron, 3'-untranslated region, promoter) on 
the basis of BLAT hit. All recognized loci were required to 
have been approved by the HGNC (30). 

Mapping of chromosome region. The chromosome region 
(cytogenetic band) was subsequently calculated based on 



the insertion site's genomic position and the Chromosome 
Band Table from UCSC (34). 

Clustering of VIS 

As described in the data model, VIS are conditionally 
clustered as a unique data entry termed viral integration 
cluster (VIS cluster). A confidence code is assigned to each 
VIS cluster indicating its frequency, according to the 
number of insertion sites that it contains (Table 1). 
Statistics of integration clusters compellingly associated 
with human malignant disease are illustrated for the 
current build in Figure 2. 

Web interfaces 

Data browser. The data browser presents a catalog of 
links to chromosome, virus and disease. Currently, there 
are 24 chromosomes, 12 viruses and 12 diseases, which can 
be browsed for VIS. 

Data search. Three search engines (keywords, position 
and the j Query search engine) are implemented in the 
data interface. Users can search Dr. VIS with keywords 
of disease, virus, chromosome region, and so on, using 
the keyword search engine. VIS clusters can also be 
selected on the basis of genomic position or chromosome 
region (cytogenetic band). Users can filter the search result 
through the jQuery search engine, which is embed in the 
table list and is powered by jQuery. 

Data visualization. For each VIS cluster, Dr. VIS provides 
an interface (Figure 3) with details and links to the UCSC 
Genome Browser and the Ensembl Genome Browser. The 
graphic view (Figure 4) summarizes the distribution of 
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Figure 3. Screenshot of the VIS details interface. 
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Figure 4. Screenshot of the graphic view of VIS located in human chromosome 1 . 
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VIS clusters over any human chromosome. Any or all of 
the viruses can be selected for display. 

DISCUSSION 

VIS associated with malignant disease were always 
detected in samples from patients. Many useful 
approaches have been applied or newly developed to 
identify VIS such as fluorescence in situ hybridization 
(FISH), linear amplification mediated PCR (LAM-PCR) 
(35), amplification of papillomavirus oncogene tran- 
scripts assay (APOT), detection of integrated papilloma 
sequences PCR (DIPS-PCR) and next-generation 
sequencing (36-38). In addition to VIS, directly detected 
in naturally infected samples, many integration sites have 
been indentified in artificial experiments or in silico (39), 
as with SeqMap (23). Dr. VIS was developed as a compre- 
hensive database of VIS associated with human malignant 
diseases. Dr. VIS is intended to facilitate biomedical appli- 
cations or systematic researches into molecular causation 
and anomalies. The current build focuses on, oncogenic 
viruses demonstrably associated with human cancers. 
Viruses potentially resulting in anomalies are also of 
great interest. Updates of Dr. VIS will be continuously 
supported, since causative viruses continue to be identified 
and the number of documented VIS is rapidly increasing. 
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